Skip to content

[codex] Prevent MCP tool metadata hangs on malformed responses#110

Merged
OhYee merged 2 commits into
mainfrom
codex/mcp-malformed-error-timeout
Jun 2, 2026
Merged

[codex] Prevent MCP tool metadata hangs on malformed responses#110
OhYee merged 2 commits into
mainfrom
codex/mcp-malformed-error-timeout

Conversation

@zoeshawwang
Copy link
Copy Markdown
Collaborator

Summary

Fixes an AgentRun SDK hang where ToolResource MCP metadata loading can wait indefinitely when the MCP Python transport logs a malformed JSON-RPC response, for example an error payload with error.message = null.

Aone: https://project.aone.alibaba-inc.com/v2/project/2139638/req/82638110

Root Cause

The MCP Python streamable HTTP transport can surface malformed JSON-RPC response parsing as an Exception on the read stream. The default ClientSession handler does not route that exception back to the pending initialize or list_tools request, so SDK callers can keep awaiting forever.

Changes

  • Bound MCP metadata operations (initialize and list_tools) with a 30s timeout so agent creation cannot hang indefinitely on malformed or silent MCP responses.
  • Bound MCP tool invocation with Config.timeout so tool calls also fail instead of waiting forever.
  • Added unit coverage for metadata timeout and tool-call timeout behavior.

Validation

  • uv run ruff check agentrun/tool/api/mcp.py tests/unittests/tool/test_mcp.py
  • uv run pytest tests/unittests/tool/test_mcp.py -q
  • uv run pytest tests/unittests/tool -q
  • git diff --check

Notes

The MCP service should still be fixed to return a valid JSON-RPC error with a string error.message; this SDK change prevents the client-side hang while preserving that server-side requirement.

Constraint: MCP Python client can log malformed JSON-RPC errors without waking pending initialize/list_tools awaits.

Rejected: Template-side timeout only | leaves SDK callers exposed to the same hang.

Confidence: high

Scope-risk: narrow

Directive: Keep MCP metadata operations bounded so agent creation cannot wait indefinitely on malformed server responses.

Tested: uv run ruff check agentrun/tool/api/mcp.py tests/unittests/tool/test_mcp.py; uv run pytest tests/unittests/tool/test_mcp.py -q; uv run pytest tests/unittests/tool -q; git diff --check

Not-tested: live MCP server returning malformed JSON-RPC error

Closes: coop#82638110

Change-Id: I20569d10af7ba44c140ab19e446d7fc35870f7ec
Constraint: Reproduce malformed JSON-RPC response before changing SDK behavior.

Rejected: Unit-only coverage | it did not exercise the MCP transport/context-manager path.

Confidence: high

Scope-risk: narrow

Directive: Keep malformed MCP response handling bounded at the SDK MCP boundary; services must still return valid JSON-RPC errors.

Tested: uv run pytest tests/e2e/test_mcp_malformed_response.py -q; uv run pytest tests/unittests/tool/test_mcp.py -q; uv run pytest tests/unittests/tool -q; uv run ruff check agentrun/tool/api/mcp.py tests/unittests/tool/test_mcp.py tests/e2e/test_mcp_malformed_response.py; git diff --check

Change-Id: Icde49bbfd79f29eb64acdab904f1f5df8df47bcd
Not-tested: Full e2e suite against remote AgentRun services.
Copy link
Copy Markdown
Member

@OhYee OhYee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #110 — Prevent MCP tool metadata hangs on malformed responses

结论:LGTM ✅

修复了一个真实的生产问题:MCP Python transport 遇到畸形 JSON-RPC response(如 error.message = null)时,SDK 调用方会无限挂起。

改动分析

  1. 超时设计合理

    • metadata 操作(initialize / list_tools):min(Config.timeout, 30s),元数据加载应该很快
    • 工具调用(call_tool):Config.timeout 或默认 600s,工具执行可能较慢
    • 设计意图清晰:如果用户配了更短的 timeout,元数据操作也应该更快失败
  2. ExceptionGroup 处理_find_mcp_timeout_error 递归搜索嵌套异常中的 TimeoutError,正确处理 asyncio 可能产生的 ExceptionGroup 包装。str(exc).startswith("MCP ") 前缀检查虽然基于字符串,但只匹配自己抛出的 TimeoutError,可控

  3. 测试覆盖充分

    • E2E:用真实的 malformed MCP server(FastAPI mock)验证超时行为
    • 单元:mock never_return 协程验证 initialize 和 call_tool 的超时
  4. 附带清理get_agentrun_signed_headers import 从文件中间移到顶层,ToolSchema unused import 移除

Minor

代码在 streamable / SSE 两个分支中有重复的 timeout wrapping 逻辑,但这是已有的代码结构(两种传输模式分开处理),不是本 PR 引入的。

🤖 Reviewed by Cortex + Claude Code

@OhYee OhYee merged commit 9ebbc43 into main Jun 2, 2026
2 of 3 checks passed
@OhYee OhYee deleted the codex/mcp-malformed-error-timeout branch June 2, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants